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The validity and dependability of functional 
coipetency tests for adults are exaiined as they relate to the 
information needs of instructional decision lakers. Test data froa 
the Adult Perforaance Level (a.^) Prograa (funded by the U.S. Office 
of Education at the University of -^exas at Austin) is used to 
Illustrate key points, in the discussion of validity, the iaportance 
of a test's demonstrated relevance to functional coapetency ^s 
discussed in terar of the definitions of the coapetency. Issues of 
content vs. criterion validity are examined particularly with 
reference to the APL study. Scae of the probleas inherent in setting 
and applying cutoffs (points on a scale of scores which define levels 
of competence) are then discussed, and the author reviews several 
procedures to aid in setting and adjusting cutoffs tthcse used by 
Nedelsky and by Emrick, and Bayesian techniques used by Northcutt) . 
In the discussion of dependability (the degree to which scores are 
replicafcle) the author reviews briefly the Hork of Bob Brennan and 
Mike Kane (based on that of Cronbach and others) in the area cf 
defining and assessing psychometric properties of 

criterion-referenced tests. In conclusion it is pointed out that the 
instructional decision maker may raise or lower a cutoff as 
information justifies such action but that there will be instances in 
which trade-offs between dependability and validity may become 
necejjsary. (JT) 
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MAKING DECISIONS ABOU'l ADULT LEARNERS BASED ON 
PERFORMiXNCBS ON FUNCTIONAL COMPETENCY MEASURES 



Adult Basic Education (ABE) has long concerned itself with 
those individuals whose ability to function within society is at 
a marginal level • A symptom of the condition of marginal function- 
ing has always been either illiteracy or functional illiteracy* 
Currently the phrase "functional competency'' is perhaps more 
comprehensive* Adult educators have risen to the challenge of 
educating adults to be functionally competent, and the concept of 
functional competency has gained national recognition. 

The Economic Opportunity Act of 1964 (PL 88-452, Title IIB) 
and the Adult Education Act of 1966 (PL 89-750, Title III) have 
focused national attention on the functional competency needs of 
adults. A national "Right to Read" Adult Movement (sponsored by 
the U.S. Department of Health, Education, and Welfare) adopted 
the following policy statement in 1970: 

The challenge is to foster through every means 
the ability to read, write, and compute with the 
functional competence needed for meeting the 
requirements of adult living^. 
This focus on functional competencies, on "coping skills", 
eventually led to the U.S. Office of Education funded Adult 
Performance Level study at the University of Texas at Austin. 
The purpose of the study was twofold; to specify the competencies 
required for functioning in society, and to develop devices for 
assessing those competencies. The underlying assumptions were, 
of course, that definable competencies did exist and that they 
could be measured. 
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Functional competency, when opei'at ionally defined in terms 
of specific tests, typically implies that there is a cutoff point 
or set of cutoff points which define levels of competence. In 
the case of one cutoff point, those persons scoring at or above 
the cutoff are considered competent, wliile those scoring below 
are not. In the case of two or more cutoff points, individuals 
are placed into categories as a result of their scores in relation 
to the various cutoffs. 

The decision maker is immediately faced with two questions; 
these concern the validity of the test and the degree to which 
scores are replicable. For the purpose of this paper, these con- 
cerns will be referred to as validity and dependability. The 
remainder of this paper will be devoted to the issues of validity 
and dependability of measurement as they relate to information 
needs of instructional decision makers. Test data from the Adult 
Performance Level Program (ACT, 1976, 1977) will be used to 
illustrate key points. 
Validit y 

Decision makers may place several requirements on tests of 
functional competency. These tests must, above all, have some 
demonstrated relevance to functional competency, as defined in 
a way acceptable to the decision maker. Thus, for example, if 
functional competen.zy is defined in terms of social and economic 
success (and if this definitioi is acceptable to the decision 
maker) then tests of functional competency must demonstrate a 
positive correlation with measures of social and economic success 
in order to be considered valid (i.e., to possess criterion 
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validity). If, on the other hand, competency is defined strictly 
in terms of mastery of a specified set of objectives then the 
validity of functional competency tests rests in the judged 
relevance of individual items to the several objectives (content 
validity). In any event, the operational definition used in the 
construction of a competency measure (and the definition may very 
>ell suggest both cortent and criterion validity) will dictate 
validation procedures to a certain extent. Whether the decision 
maker uses a locally constructed measure, or a nationally standard- 
ized one, the relationship between the acceptable definition of 
competency and the available validity data should be examined 
carefully . 

Nafziger, Thompson, Hiscox, and Owen (1975) reviewed several 
measures of what they termed "functional literacy" (for all 
practical purposes very similar to functional competency but less 
comprehensive). Of the four criterion referenced tests reviewed, 
all were rated as good with respect to content or construct 
validity and fair to poor with respect to criterion validity. 
Overall, the validity of each measure (including the 42 item 
Texas APL Survey) was rated as fair. It is clear, however, that 
the developers of the four tests concentrated on content validity, 
while the definition accepted by Nafziger et_ a_l. included both 
content and criterion validity. 

The definition of functional competency developed by the 
University of Texas APL research team (Northcutt, Selz ,'^Shelton, 
§ Nyer, 1975) stated that: 1) the term functional competency is 
meaningful only in a specific societal context; 2) functional 
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competency is best described as the application of a set of 
skills to a set of general knowledge areas; 3) functional 
competency results from a combination of individual capabilities 
and societal requirements; and 4) functional competency is 
directly related to success in adult life. Points (2) and (4) 
of the definition may be viewed as dictating content and criterion 
validation procedures. Yet, Northcutt e_t al. seemed to concentrate 
on point (2), in terms of validity information, in their final 
report. This emphasis is reflected in the fact that Nafziger £t 
al . rated the APL Survey very highly in terms of content validity 
and very poorly in terms of criterion validity. 

A criticism on similar grounds was later voiced by Griffith 
and Cervero (1977). They argued that both the original University 
of Texas APL researchers and American College Testing Progrpm APL 
staff had devoted t-^o little attention to criterion validity. More 
recently, Cervero has provided some criterion validity information 
regarding the APL Survey''. In a rea.ialysis of original APL Survey 
data, Cervero found significant correlations between Texas developed 
APL Survey scores and measures of success. These were .56 for 
years of schooling, .55 for occupational status, and .39 for family 
income. All correlations were based on 5,000 to 8,000 responses 
and significant beyond the .001 level. According to Cervero (p. 4), 
"Since the correlations between APL test score and indicators of 
'success' are about as good as would be expected, it could be argued 
thit the APL test is directly related to 'success' in adult life, 
as the developers assume". 
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Correlations between APL Content Area Measures and adult 
success criterion variables were not as high as those found for 
the original APL Survey. These correlations, reported in the 
APL Content Area Measure Technical Supplement, (ACT, 1977f) 
ranged from .09 to .19 for family income (median r = .15) and 
from .19 to .21 for years of education (median r = .20). All 
correlations were based on 650 to 1,100 responses. Although all 
were significant, they were less than one might _ expect , given 
previous findings (e.g. Jencks e;t al . , 1972). 

Performance on APL Content Area Measures is understandably 
interpreted in terms of instructional goals. Whereas levels on 
the original APL Survey (Northcutt et al. , 1975) were touched 
in terms of likelihood of success in a'ult life, ACT level 
definitions are as follows: 

Level 1 - Has an inadequate degree of competency - a 

definite need for study and remediation to meet 
the APL goals and objectives through the appli- 
cation of basic skills. 
Level 2 - Has a marginal degree of competency a need 

for study and review to meet the APL goals and 
objectives through the application of basic skills 
Level 5 - Has an adequate degree of competency - may need 

some review tc continue to meet the APL goals and 
objectives through the application of basic skills 
Given these definitions, the instructional decision maker 
has no basis for relating test performance directly to lik.'ihocd 
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of success in life. Learners are evaluated strictly in terms 
of objective mastery. 

A question which immediately arises when adjectives such 
as "inadequate", "marginal", or "adequate" are used, no matter 
what the context, is "By what criterion?" That is, what is the 
standard by which these labels are attached to individual per- 
formances? There is a score, for example, below which performance 
is judged to be inadequate and above which performance is judged 
to be adequate (or marginal). The process by which these scores 
are established is of crucial importance. Analysis of this process 
is no less important than an analysis of the content or criterion 
validity of the test because effects of the process on the learner 
are no less profound than those of test validity. 

Greater attention will be paid to the setting of cutoffs 
within the section on dependability but it seems important to 
outline here some of the problems inherent in setting cutoffs 
and some of the related problems faced by instructional decision 
makers. It is perhaps little consolation to find that these 
problems arc not unique to the field of functional literacy/ 
competency. They are simply a little more actue because of 
the current visibility of functional competency. 

It is typically the case that criteria or cutoff scores 
are set more or loss arbitrarily*^. This is true even of many 
nationally published tests which have cutoffs. An excellent 
review of some of the procedures by which cutoffs may be set 
more objectively may be found in an article by john Meskauskas 
(1976). Although there is a certain degree of arbitrariness 
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in all procedures rcviei. i, elements of objectivity are intro- 
duced which have the effect of reducing arbitrariness, to 
varying degrees, in each of the methods. Two procedures may 
serve as illustration, although others are certainly possible 
and defensible. 

The Minimum Pass Level (MPL) developed by Nedelsky (1954) 
utilizes the judgements of several persons who rate individual 
items with respect to difficulty. Let us assume that seven 
instructors (A through G) each rate one hundred test items (1 
through 100). Instructor A looks at item 1 and predicts the 
chances of the hypotethically lowest passing learner (i.e., 
the least competent of the competent) for answering the item 
correctly. Instructor A then does t same with items 2 through 
100 and adds the probabilities to get an MPL. Instructors B through 
G do the same. One can then express the minimum passing level 
(MPL) as follows: 



where M^^^ is th- mean of the individual instructor MPLs and °FD 
is the standard deviation of the distribution of individual MPLs. 
FD refers to a cutoff between grades of V and 1). For tests such 
as the APL Survey or Content Area Measures, one might just as 
easily focus on the cutoff separating levels 1 and 2 and on the 
cutoff separating levels 2 and 5. K is a constant which may be 
adjusted to control the percentage of marginal students who "pass" 
the test. The essential subjective element.s nre the individual 
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predictions of learner success on given items and the setting of 
the value of K. This method does have some advantages over a 
totally ad hoc approach in that it does focub on individual items 
and forces some structure onto the process. Ebel (1972) has 
developed a similar procedure which essentially extends iNedelsky's 
jnodel into two dimensions (relevance and difficulty). 

A procedure attributed to Emrick (19/1) draws upon decision 
theory in that the test designer or administrator must express 
certain subjective factors upon which he or she bases decisions. 
Although the procedure treats competency as an all or none trait 
(i.e., there is no underlying continuum of mastery; a learner 
has cither mastered or failed to master a given curriculum). It 
mcty be viewed as helpful in setting cutoffs because it relates 
test performance to performance in other areas and is best applied 
at the subtest level (i.e., units of about ten items). The 
decision maker is forced to make a statement about how bad 
different kinds of errors of classification would be. Let us 
cai: the erroneous placement of a non-master into the master 
cate^^ory (on the basis of a response to any given item) a Type 1 
error (false positive) and the converse error a T> pe 2 error 
(false negative). The probability of iraking a Type 1 error will 
be expressed as a, while the probability of a Type 2 error will 
be expressed as 8. Now the decision maker roust express in a ratio 
the relative losses associated with these two types of errors. 
Emric': (1971) calls this the ratio of regret (RR) • This ratio 
is purely subjective unless, of course, real costs may be determined 
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for each type of loss. The optimal cutting score (C) ma/ be 
expressed in terms of test length (n) and these other factors 
as follows: 



log + 1/n (log RR) 
, C = ^—^ (2) 

log oB 

(1 - cx)(l - B) 

Information about learners accumulated over a period of time 
may provide empirical estimates of a and B in equation (2). If, 
for example, it is discovered that five percent of those learners 
who answer certain items cjrrect'y have not actually mastered the 
content, then a = .05. If, on the other hand, ten percent of 
learners who respond incorrectly to certain items are actually 
masters, then B = .10. Assuming now that the two types of errors 
are equally serious, RR would equal 1.0. Thus, for a 10 item sub- 
test equation (2) would yield a cutting score of 4.4 which could 
be rounded off to 4 or 5. The values of C for a whole test could 
be added together to yield a total test cutoff. In the special 
case where Type 3 and Type 2 errors have an equal probability of 
occurring (a=B) , and both types are considered equally serious 
(RR = 1.0) it can be shown that the cutoff score will always bo 
exactly half the total number of items. • 

Of course, it will not always be th3 case that all things will 
be equal, and the cutoff will have to be set at some point other 
than 0.5. Figures 1 through 3 are provided to show what happens 
to C as each of the parameters changes. As can be seen from 
Figure 1, the value of C levels off very quickly as RR incroa*^es 
for the given values of a and B. In other words, the value of 
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Figuie I. Cutoff (C) as a function of ratio of regret (RR) with 
values of <i and P fixed at .05 and .10, respectively. 
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Figure 2. Cutoff (C) as a function of probability of false 
positive error (cc) with values of p, and RR fixed 
at .10 and 1.0 respectively. 
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Figure 3. Cutoff (C) as a function nf probability of false 
negative error (Bj with values of a and B '"ixeU 
at .05 and .10, respectively. 
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the most subjective parameter of equation (2) seems to have 
little impact on C for these data. Although the largest value 
of RR is 100 times as great as the smalles.t value, the range is 
only .09 (i.e., from .39 to .48). 

On the other hand, values of C change rather dramatically as 
either a and g increases. In Figure 2 the value of C ranges from 
.34 to .62 while a goes from .01 to .30. The range of C is thus 
three times that of C in Figure 1. Likewise, in Figure 3, the 
range of C is from .32 to .60 or about three times the range of 
C in Figure 1. Also note that as a increases, C increases^ while 
C decreases as both 6 and RR increase. As the likelihood of 
classifying non-masters as masters increases, one is forced to 
raise the cutoff. As the likelihood of classifying masters as 
non-masters increases, one is forced to lower the cutoff. Simi- 
larly, if the second type of misclas sif ication is considered to 
-be a more serious mistake (larger regret) than a miscla^sii ication 
of the first type (smaller regret), then it will be necessary to 
lower the cutoff. Although other values for each of the three 
parameters could have been chosen, these are representative of 
likely values one might obtain empirically. Other sets of parameters 
might yield very different kinds of curves. In fact for some 
values of a and 3, C will be undefined, for example, when 
a + 3 = 1.0, or when all examinees are misclassif ied . Under 
such conditions, the decision maker is well advised to choose 
an alternative method for 'establishing cutoffs. 

The point of this admittedly rather lengthy discourse is 
this: the setting of cutoffs on functional competency measures 
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need not be completely arbitrary. In fact, because behavioral 
manifestations of competency will vary from place to place, it 
is advisable to consider setting one's own population specific 
cutoff. The instructional decision maker can and should maintain 
a constant surveillance over the effects of cutoffs on placement 
and subsequent performance of learners and adjust as he or she 
sees need to do so. This adjustment becomes easier "^f the criterion 
is something with which the decision maker is quite familiar, such 
as curriculum objectives. This adjustment becomes more difficult 
if the criterion is something with which the decision maker is 
less familiar, such as the actual life success of individual 
learners. This reason, as well as for other reasons, it would 
seem more appropriate for adult educators to concentrate on 
curriculum objectives rather than on global indicators of life 
5uccess. While several procedures are available to aid in setting 
cutoffs, the decision maker should rely on the method which 
matches his or her definition of competency and characteristics 
of the program and learners. 

A procedure unlike either of the two just described ( viz . ., 
Nedclsky, 1954; Emrick, 1971) was used by Northcutt (1974) to set 
cutoffs on the APL Survey. In his procedure, Northcutt used 
Bayesian techniques (see, for example, Novick, 1973, for a review 
of Bayesian applications). First, he obtained a rough concensus 
regarding the operational definition of adult success. Next, 
the Opinion Research Corporation was employed to conduct a 
nationwide survey of a representative sample of adults to estimate 
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the percentages of adults classified at each success level. This 
same sample was also given the first version of the APL Survey. 
It was found that items could discriminate among the three groups 
of adults (^vith respect to life success). The test score related 
level classifications which ultimately emerged took into account 
this discriminating power of items. The process underwent several 
refinements before the final cutoffs were set. By this process, 
it was estimated that roughly 20?; of the adult population of the 
United States were functionally incompetent (Level 1), 34% were 
marginally competent (Level 2), and 46% were proficient (Level 3). 

Move recently, Jerry Williams^ set cutoffs on an APL test by 
comparing the performances of various groups of adulrs on the test. 
These various subgroups were aggregated into two major groups, 
productive and marginally productive. The productive group con- 
tained professionals, machinists, craftsmen, sales workers, 
farmers, and so on. The marginally productive group consisted 
of prison inmates, unemployed, and persons for whom English was 
not a native tongue (but who were receiving English instruction). 
By comparing the median scores for alj groups, Williams found a 
fairly clean break at about 701. This percentage was taken as a 
rough estimate of a desired level of performance. The actual 
cutoff used was moderated by a procedure similar to Emrick's \l971) 
such that the actual cutoff was .60. 

The examples just given show the relationship between test 
validity and setting of cutoffs. In one case (Nedelsky, 1954) the 
setting of a cutoff was related more or less to content validity. 
In the other cases, cutoffs were more clearly related to criterion 
validity. The key issue here is that the subjectivity of classi- 
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fication of learners may be greatly reduced through a modicum of 
effort. Given the context of validity based cutoffs (which need 
not be elaborately worked out), the instructor of adult learners 
may render very defensible, data based judgements. 
Dependability of Measurement 

A specific implication of functional competency testing is 
that adults are not ranked in order of score but rather that each 
person's score is compared to a predetermined cut-cff or set of 
cut-offs . Thus , functional competency testing is typically 
outside the realm of norm-referenced testing and well within 
the realm of criterion, or domain referenced testing. 

Most of test theory, as we know it today, has been developed 
around the concept of ranking individuals along some continuum. 
The concept of cut-off, or minimum level of peiformancc has never 
been very important. Within the past two decades, however, this 
concept has become very important. The individualized instruction 
movement of the late 1940* s and beyond raised many technical 
questions, including a number related to testing. These questions 
were addressed by several researchers from about 1960 to the 
present. Most of the research focused on individual items; how 
to construct them, how to select them, etc. A few researchers 
concentrated on assessing the characteristics of decision making 
procedures, which included total test qualities as well as the 
setting of cutoffs. 

The most promising work in the area of defining and assessing 
psychometric properties of critc^- ion- re f crenced tests has been 
done by Bob Brennan and Mik ^ Knne (Brennan, 1977a, 1977b; Brcnnan 
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§ Kane, 1977 , in press; Kane fi Brennan, 1977). Their work stems 
directly from thdt of Cronbach, Gleser, Nanda, 5 Rajaratnam (1972). 
Whereas the work of Cronbach et al. concentrated on norm- referenced 
tests, Brennan and Kane have focused on criterion or domain- 
referenced tests. One difference in the two approaches lies in 
the fact that Brennan and Kane allow for cut-off scores. 

While I will attempt to summarize these works here enough 
to shed some light on the remainder of the paper, this review is 
by no means exhaustive or comprehensive. Those interested are 
directed especially to the book by Cronbach et al_. (1972) and 
the article by Brennan ^ Kane (1977). Following this review, T 
shall present data from" the development of the APL Content Area 
Measures (ACT, 1977a- f) which illustrate uses of dependability/ 
generalizability theory. I shall also attempt to demonstrate the 
applicability of such procedures to local decision-making processes 
involving adult learners and measures of functional competency. 

Cronbach ejt cH. (1972) suggested a liberalization of test 
theory to take into account more than two facets in the determination 
of the reliability of measures. This liberalization has come to 
be known as generalizability theory, as opposed to classical test 
theory. While classical test theory treats reliability as the 
ratio of two variances (cf. Guilford, 1954; Lord f? Novick, 1968), 
this approach considers ^nly two types of variance; namely true 
score and error. In classical terms, observed score variance, 
a^(t) is viewed as divisible into two components as defined in 
the f ollowi ng equa t i on : 
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v/here o^(T) is true ^core variance, and o^(e) is error variance. 
In thi3 context reiiab?lity (r) is expressed as a ratio: 
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In the most straightforward case, a group of examinees is 
given a set of items, and this process is called a test admini- 
stration. In this simplest case, at least three definable things 
or components enter into total score variance. These are the 
items, the examinees, and error. In the terminology of Cronbach 
et al . the observed score of examinee p on item i (X^^) may be 
expressed as 



Xpi - ^ r,^^ B.^ .Bp. . e (5) 

where p is the grand mean across persons .*nd items: tt is the 

, p 

effect due to person p; 0^ is the effect due to item i; ^6p^ is 
the .effect due to the interaction of person p and item i; and e 
is experimental error. Since person p only takes item i once, 
it is not possible in this situation to estimate the interaction 
effect. Therefore, the effects "6^^ and e are Lumped together 
in a common error term. Thus, 

Xpi = P ^^p 6. + TrB,e (6) 

where 'ir6,o is the common error term, and all other terms arc as 
defined in equation (5). 
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Reliability, within the context of generalizabili t/ theory, 
is also expressed in terms of variances or variance components* 
Ho\v'ever, before entering into a discussion of these components of 
variance, it will be necessary to discuss two contexts within 
which variance components are computed. These contexts are 
generalizability studies and decision studies. 

Cronbach et_ al^. (1972) distinguish between generalizability 
studies, or G-studies, and decision studies, or D-studies. In 
a G-study, one is typically interested primarily in a theoretically 
infinite population of examinees and universe of items. In a D 
study, one is typically interested in a more narrowly defined 
group of examinees and/or items. A test publisher may, for 
example, administer a new test to a nationally selected group of 
examinees. The intent of this administration may be to accumulate 
information about the degree to which the test results generalize 
to the domain (or item universe) of interest. In a D-study a 
local decision maker may be interested only in the performance 
of a specific group of examinees (a class) on a specific set of 
items (a form of the test). Note, however, that the test developer 
. may also wish to conduct a D-study using all or part of the infor- 
mation gathered in the G-study, 

Once a test has been administered, it is possible to view 
the results in terms of a two facet analysis of variance problem 
where the facets are persons (p) and items (i) . In this p-b>-i 
design, the score of person p on item i may be expressed as in 
equation (6). By using analysis of variance procedures, it is 
possible to obtain mean squares (MS) due to persons, items, and 
the person-item interaction, which will be taken as the error 
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component. It can be shown (cf. Brennan, 1977a) that v'ariance 
components are directly estimable from mean squares. Specifically, 

(p) = {MS (p) - MS (pi)}/n., (7) 

(i) = {MS (p) - MS (pi)}/np, (8) 

and 

5' (pi) = MS (pi), (9) 

where MS (p) is equal to the mean squ: re for persons, MS (i) is 
equal to the mean square for items, MS (pi) is equal to the mean 
square for the person-by- item interaction; (p) , (i), and 

o (pi) are the esti:nated G-stuay variance components for persons, 
items, and the interaction term, respectively. 

These estimates represent the variance components obtained 
in the simplest case; i.e., the person-by-item case. Far more 
complex cases are possible (and are treated by Brennan, 1977a) but 
need not he examined here. These variance component estimates 
are quite helpful to the test consumer in terms of evaluating 
various test of similar content. In fact, the American Psycho- 
logical Association 1.APAJ American Educational Research Association 
(AERi\) and National Council on Measurement in Education (NCME) 
strongly suggest reporting G-study variance components along with 
reliability data in technical manuals for published tests (APA, 
1974). 

D-study variance components may be derived directly from 
G-study components, once the testing model has been defined and 
a decision has been made as to hou far one wants to generalize 
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results. Brennan (1977a) has devised a system to aid the decision 
maker in specifying these parameters and deriving variance com- 
ponents . 

For the purpose of this paper, let us assume that we are 
interested in being able to generalize over a potentially infinite 
universe of items. In this case, the D-study variance components 
may be expressed as follows: 

= 5' (P), (10) 
5^ (IJ = 0^ (i)/n'., (11) 

and 

9^ (pi) = 0^ (pij/n*.. (12) 

In equation (10), the D-study variance component for persons 
is equal to the G-study variance component for persons. This will 
be the case when person is the unit of analysis (other possibilities 
for unit of analysis include class, school, state, etc.). In 
equations (11) and (12), the capital I denotes sampling across 
items. The term n*. in equations (IJ) and (12) represents the 
number of items in the particular test used in the D-study. 

Given these D-study variance components, it is possible to^ 
estimate two types of error for a given test. One is associated 
with norm referenced testing situations and is denoted o^(6). 
The other is associated primarily with criterion referenced 
testing and is denoted o^(a). Cronbach et al^. (1972) indicate 
that o^(6) is appropriate for expressing error in terms of the 
deviation from the population mean. o^{^) is, on the other hand, 
appropriate for expressing error associated with the differences 
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between a given examinees' item universe scores and observed 
scores. In terms of equations (11) and (12), we may operationally 
define o^(6) and a'(A) as follows: 

S'(6) = 5^(pl), (15) 

and 

32(A) = o\\) ^ o2(pI} (14) 
\vhere all terms are as defined above and in equations (11) and 
(12). 

Cronbach et al. (1972) use the term 0^(6) in the calculation 
of the general i zabil ity index, ep^ or the ratio of universe score 
variance to expected observed score variance. This is essentially 
the same coefficient as coefficient alpha (Cronbach, 1951} and KR-20 
(Kuder § Richardson, 1937). It is traditionally taken as the 
estimate of the internal consistency reliability of a test and 
may be expressed as 

, ai(£) 

S'(P) + oHvi) 

where all terms are as defined above and in equations (10) and 
(12j. 

Brennan f, Kane {1977j used the error term a^(Zi) in developing 
an index of dependability for criterion referenced tests or any 
test which contains one or more cut-offs. Their index, called 
M (C) may be expressed in terms of variance compone.its as follows: 

M (CJ = ' t'"^^^ , (16) 

O'p + (l;-C)' + 0-(IJ + OMPI) 

ivhcre is the population score mean, C is the cut-off score, and 
other terms arc as defined in equations (lOj through (12j. When 
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items are scored simply as correct/ incorrect (or 1/0), Brennan 
§ Kane (19/7) have shown that equation (16) may be estimated from 
sample means and variances: 



M (C) = 1--^ 



1 



(Xpi - C)2 ^ (Xpj)^ 



(17) 



where Xpj is the sample mean over items and examinees, S^(Xpj) is 
the sample variance of persons' scores over items, and M (C) stands 
for the estimated value of M (C) . 

Finally, when the cut-off is equal to the sample mean (C=Xpj), 
Brennan (1977b) has shown that: 



M (C) 



Xpj (1-XpJ - (X^^) 



(18) 



where all terms arc as defined in equition (17). This equation is 
identical to the internal consistency estimate of tests derived by 
Kuder 5 Richardson (1957) in their formula 21. This value is the 
lowest possible value cf'tTT^T for a given testing situation and 
will be denoted KR-21 throughout the remainder of this paper. It 
can also be shown that as the value of C approaches the maximum 
or minimum possible score, 'nMC} will approach its maximum value, 
and as C approaches Xpj , 'NpCCT approaches KR-21. Implications 
for the .netting of cut-offs are discussed in the following example. 

Data from the development of the APL Content Area Measures 
(ACT, 1977 a-fj are used here because of the relevance of the 
APL program to functional competency and because generalizahility/ 
dependability procedures were used in their development. Data 
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were collected in the spring (April) of 1977 from a total of 
4,565 adult education students representing a cross section of 
four regions and five different community sizes in the United 
States. Inasmuch as there were five Content Area Measures, each 
adult education student responded to items in only one content 
area. Table 1 shows the number of items in each Content Area 
Measure (CAM) and the number of examinees associated with the 
development of each CAM. 

Table 1 

Numbers of Items and Examinees Associated with each Content Area Measure 
Content Are'a Measure Items Examinees 



Community Resources 51 855 

Occupational Knowledge 42 866 

Consumer Economics 66 - 1,148 

Health 45 ' 841 

Government and Law 45 853 



Variance components for each test were estimated through multiple 
matrix sampling procedures (Shoemaker, 1975). These variance 
components were then used to obtaij. values of (6), o^(ii), ep^, 
and M(C) . Since each CAM has, in effect, two cut-offs, two values 
of M(C) were calculated for each test. In addition, other values 
of M^^)" were obtained for a range of cut-offs, including the sample 
mean. Table 2 reports these estimates by CAM. Note that KR-21 
refers to the value of M(C) where C = X^j. M^C^ refers to the 
lower cut-off, while ^U^p' refers to the upper cut-off; i.e., that 
which separates Level 2 from Level 5- 
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Table 2 



Error Components, Generalizability , and Dependability of Total Scores 
on APL Content Area Measures for Adult Education Students 



Content Area Measure 


5^6) 






KR-21 




^^^^^^^^^^ 


Community Resources 


. 00257 


.94 


.00304 


.93 


.97 


.93 


Occupational Knowledge 


.00543 


.92 


. 00388 


.90 


.95 


.91 


Consumer Economics 


.00208 


.94 


.00271 


.92 


.96 


.93 


Health 


. 00349 


.91 


.00387 


.90 


.95 


.91 


Government and Law 


. 00364 


.89 


.00445 


.87 


.92 


.91 
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As Table 2 shows, values of ep^ are fairly high, ranging from 
.89 for Government and Law to .94 for Community Resources and 
Consumer Economics. Also, the values of M(C2) . This reflects the 
fact that the sample means for each CAM were closer to the upper 
cutoff. In every case, the lower cutoff was set at 51% correct, 
and the upper cutoff was set at 76% correct. The sample means were 
74% correct for Community Resources, 73% for Occupational Knowledge, 
70% for Consumer Economics, 71% for Health, and 65% correct for the 
Government a'nd Law CAM. In the case of Government and Law, values 
of M(C) differ by only .01. The mean score for the Government and 
Law CAM (65) falls close to halfway between 51% and 76%; thus. 



values of (X 



PI 



C) are very similar for the two cutoffs 



The publishers of the APL Content Area Measures suggest that 
local decision makers may wish to modify cutoffs to suit local 
needs. Altering the cutoff, however, will result in a change 
in the dependability of the measures. The values listed in Table 2 
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under KR-21 represent the lowest possible values of M(C) for the 
data used in the development of the CAMs * It is also possible to 
set cutoffs in such a way as to increase the value of ^(^T* Figures 
4 through 8 demonstrate the results of raising or lowering the 
value of C* 

As can be seen in Figures 4 through 8, the generalizabil ity 
coefficient ep^ is totally unaffected by the value of the cutoff C. 
In other words, the position of the cutoff has no bearing on the 
ability of the test to rank order people. Note also, that the 
lowest value of M(C) is always below the ep^ line,. This is because 
the coefficient M(C) incorporates the variance due to item sampling 
in its definition of error, whereas ep^ does not. Thus, by in- 
corporating item variance in order to make absolute evaluations 
more meaningful, M(C) becomes a more conservative estimate of the 
precision of the test than ep^. 

Again , in re f ercnce to Figures 4 through 8 , the values of 
M(C) increase rather slowly for Community Resources (Figure 4) 
and Consumer Economics (Figure 6) as C moves away from the sample 
mean. M(C) increases quite dramatically for Occupational Knowledge 
(Figure 5), Health (Figure 7) and Government and Law (Figure 8). 
These differences in slope reflect differences in the relative size 
of a (A) or error variance associated with each CAM. This is not 
to say that these three CAMs are inherently error prone but rather, 
that as the cutoff moves from the extremes to the mean, the 
dependability of the testing procedure declines more rapidly than 
it docs in the Community Resources and Consumer Economics CAMs. 
In each CAM, the value ofM^^ is nearly 1,0 when the cutoff is 
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COMMUNITY RESOURCES 
CONTENT AREA MEASURE 




Figure 4- Generalizability/Dependability Coefficient as a Function 
of Cutoff. 
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OCCUPATIONAL KNOWLEDGE 
CONTENT AREA MEASURE 



1.00, 



M(C) 




^1.00 



.95 



X = .7279 
-PI 



.85 o2(A) = .0039 
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Figure 5. Generalizability/Dependability Coefficient as a Function 
of Cutoff 
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CONSUMER ECONOMICS 
CONTENT AREA MEASURE 



1.00^ M(C) pi. 00 
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n. = 66 
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Figure 6. Generalizability/Dependability Coefficient as a Function of Cutoff, 
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HEALTH 
CONTENT AREA MEASURE 
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Figure 7. Generalizability/Dependabi] ity Coefficient as a 
Function of Cutoffs _ . 
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GOVERNMENT AND LAW 
CONTENT AREA MEASURE 
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Figure 8. Generalizability/Dependability Coefficient as 
a Function of Cutoff. 
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set at 0 or 100% (trivial and highly unlikely cutoffs). Further- 
more, respectable values of M(C) are maintained throughout the 
entire range of possible cutoff scores for each CAM. 

Implications and Problems 

Recalling now that the decision maker may raise or lower a 
cutoff as information justifies such actioi, one can see that 
there will be instances in which trade-offs between dependability 
and validity may become necessary. Assume for a moment that the 
cutoff score for Community Resources (Figure 4) that satisfied 
the conditions of equation (2) had been .74, or about 38 items 
correct. This would be the worst possible cutoff as far as 
dependability is concerned. Similar situations may arise if one 
uses Nedelsky's method, Fbel's, or any other content or criterion 
validity related method of setting cutoffs. 

For the Community Resources CAM, the value of M(CJ whore 
C = 38 (74o correct) is .93. By either raising or lowering the 
cutoff, the decision maker could increase the dependability of 
the testing procedure. However, such action would also, in all 
likelihood, alter the probabilities of misclass i f ication with 
respect to the external criterion. 

In this particular case, the dilemma may not be very serious. 
The M(C) value of .93 is quite good. In other instances, it 
would be advisable for the decision maker to calculate or obtain 
values of KR-21 for the test tu be used. If the value of KR-21 
represents an acceptable level of M(C;, than any value of C 
obtained through any cutoff setting procedure would be satisfactory 
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Now suppose that for a given test the obtained value of 
KR-21 does not represent an acceptable level of dependability. 
This does not automatically mean that the test must be ruled 
out as an aid in making decisions about learners. Instead, 
this low value will limit the range of C. Should the value of 
C derived by equations (1) or (2) or any other procedure fall 
outside this restricted range, then adjustments are called for. 

It might seem logical in such instances to ignore dependa- 
bility indices and allow validity information alone to govern 
the setting of cutoffs. However, recall ,that a low value of 
M(C) (including KR-21) indicates a great deal of item variability 
relative to person variability. The model described in equation 
(2) does not allow for much item variability. Therefore, to the 
extent that item variability is large relative to person variability, 
the cutoff derived through equation (2) will be somewhat tenuous. 
For strictly content oriented models, item variability may also 
be a problem, depending on ho^ narrowly one defined the domain 
of interest. The seriousness of this problem, given content 
oriented models, is not as obvious as in limrick's (1971) model. 

Another way to deal with the validity/dependability dilemma 
is to increase test length. Note in equation (17) that as n^ 
increases, M(C) will api roach 1,0. If a value of C obtained 
through some procedure were to be inserted into equation (17) and 
a minimum acceptable value of MCcT were Fet, then it would be 
possible to solve for n^, the number of items needed to test ^ 
at the desired cutoff and level of dependability. For locally 
produced tests this solution may be relatively easy to implement. 
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If, however, the decision maker is relying on standardized 
products, such a solution may be less appealing. 

For tests -such as the APL measures, where two or more 
cutoffs are suggested, a different kind of problem is possible 
It may turn out that data do not support a three group inter- 
pretation of test scores. In some instances, it may be more 
appropriate simply to classify learners into one of two cate- 
gories, rather than into one of three or more categories. For 
example, adult education students who scored in the Average or 
Above Average range on some APL Survey (ACT, 1976) subtests may 
in some instructional settings be treated as similar to each 
other but collectively different from those who scored in the 
Below Average Range. A compari5on of group score means would 
reveal whether or not such a strategy would be advisable. Cut- 
offs would then be adjusted accordingly. 

Whatever the course taken in dealing with dependability/ 
validity data, the crucial point is that somewhere in the process, 
the learner must derive some benefit over and above that which 
might be derived through random or arbitrary assignment. The 
benefit that will accrue to the learner will be a function of 
the correct classification of learner competencies and subsequent 
instruction- Within adult basic education, this focus on 
classification and instruction of individuals is seen as highly 
appropriate. Methods of assessing functional competency should 
be and generally are likewise individually oriented. Brennan and 
Kane (Brennan, 1977a, 1977b; Brennan 5 Kane, 1977, in press; 
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Kane 5 Brennan, 1977) ha^ c Jevised a frame of reference fcr 
expressing the dependability of such assessments. Examples 
drawn from the development of th'^ APL Survey and Content Area 
Measures have been provided to demonstrate the usefulness of 
this frame of reference as well as of data obtained from non-test 
sources. 

A systematic procedure ha^ been described whereby the adult 
educator may make judgements not only about adult learners but 
about tests of functional competency as well. Definition, content 
validity, cirterion validity, and dependability as previously 
described all \ lay important roles in the execution of this 
procedure. 
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NOTHS 

1. Conference on Strategies for Generating a National 
"Right to Read" Adult Movement, Raleigh, North Carolina, January, 
"1970, 

2. Ronald M. Cervero, The Adult Performance Leve l Test: 
A measure of "functional competenc e"? Unpublished manuscript. 
University of Chicago, 1978. 

3. This point is forcefully made by Gene Glass in "Standards 
and Criteria", a paper presented at the Seventh Annual Conference 
on Educational Assessment, 1977 and in "Postscript to 'Standards 
and criteria,'" a paper presented at the 1977-78 Winter conference 
on Measurement and Methodology of the Center for the Study of 
Evaluation, University of California - Los Angeles, January, 1978. 

4. Jerry K. Williams, The APL: A minimal cv cency skills 
program. A presentation to the National Assessment of Educational 
Progress, Boulder, Colorado, June 14, 1977. 
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